This analysis explores a dataset containing chemical attributes and quality for approximately 4,900 white variants of the Portuguese “Vinho Verde” wine. The objective of this analysis is to determine which physiochemical properties affect wine quality.
The following describes the attributes of the wine, taken from Cortez et al., 2009 [1].
1 - fixed acidity (tartaric acid - g/dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity (acetic acid - g/dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar (g/dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides (sodium chloride - g/dm^3): the amount of salt in the wine
6 - free sulfur dioxide (mg/dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide (mg/dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density (g/cm^3): the density of wine is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial agent and antioxidant.
11 - alcohol (% by volume): the percent alcohol content of the wine
12 - quality: score between 0 (very bad) and 10 (very excellent)
Let’s first look at the structure of the wine dataframe.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The wine dataset consists of 12 variables (X is a sequential count for each observation, so it was removed) with 4,898 observations. Now let’s have a look at the 5 number summary for the variables in the wine dataset.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Quality is the output variable, thus we are interested in how the input variables affect the wine quality.
We can investigate the distribution of white wine quality by plotting a bar graph Here we will see a distribution of quality from 0 - 10 where 0 is very bad quality and 10 is very excellent quality wine. From the 5 number summary above for quality, we can see that the minimum quality is 3 while the maximum quality is 10.
The distribution of the quality for white wine is normally distributed, the mean (5.8) and median (6) values are close to one another.
Let’s plot a bar graph for the quality of wine and categorize the quality of white wine as follows: - Bad: 0 - 4 - Average: 5 - 6 - Excellent: 7 - 10
A column named ‘level’ was added to the wine dataframe containing the wine quality levels: bad, average and excellent.
## 'data.frame': 4898 obs. of 13 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ level : chr "average" "average" "average" "average" ...
The majority of the wine is average quality, there are few bad quality wines in this dataset.
Let’s now look at histograms for the input variables.
Let’s do some transformations on the residual sugar.
Now let’s transformation the alcohol data.
When we transform the residual sugar using a log scale for the x-axis, the graph clearly shows a bi-modal distribution. After transforming the x-axis for the alcohol data, the graph is now somewhat closer to a rectangular distribution or may even display a tri-modal distribution.
The wine dataset originally consisted of 12 variables (X is a sequential count for each observation, hence it was removed) with 4,898 observations. 11 of these variables are input variables that are physiochemical properties of the wine. There is one output variable: quality.
A new column was added to the wine dataset called ‘level’ which was a categorical measure of wine quality. Wines considered ‘bad’ quality had quality ratings between 0 and 4, ‘average’ wine quality had quality ratings between 5 and 6 and wines with a quality rating between 7 and 10 were considered to be ‘excellent’ quality.
The main feature of interest is quality (the output variable); quality is based on sensory properties (such as taste, smell and sight) and these properties are affected by the physiochemical properties of the wine (input variables).
Most input variables had a normal distribution. It is hard to say which variables will have an impact on the wine quality at this stage in the investigation.
The input variables that were presumed to affect the wine quality are: - citric acid as this affects the taste/ flavor of wine - residual sugar as this affects wine sweetness - pH as it is crucial to the taste of wine - alcohol as this may influence how wine quality - chlorides as it is the amount of salt in the wine
The ‘level’ variable, based on the wine quality variable was created.
Both the residual sugar and alcohol data displayed unusual behaviour. These two variables were transformed using a log scale for the x-axis to better understand the distribution. After the transformation, the residual sugar displayed a bi-modal distribution. After transforming the x-axis for the alcohol data, distribution appeared to be closer to a normal distribution or perhaps even a bi-modal distribution.
A bar graph was also plotted using the quality levels and it was found that most wines were of ‘average’ quality with very few ‘bad’ and ‘excellent’ quality wines. This may be challenging to build a predictive model with this data as our sample mainly consists of ‘average’ quality wine with very few ‘bad’ and ‘excellent’ wines.
Let’s first analyse the correlation coefficients between the variables using a scatterplot matrix:
We can see the correlation coefficients between the variables from the scatterplot matrix, it would be better to view them with the most significant correlations emphasized using a colour scheme. This was done below.
It is not immediately clear from scatterplot matrix which correlations are the strongest. The correlation matrix plot however easily depicts which correlation coefficient is highest and lowest by use of colour.
Here we see that a few variables negatively affect the wine quality: - density (-0.31) - chlorides (-0.21) - volatile acidity (-0.19) - total sulfur dioxide (-0.18) - fixed acidity (-0.11) - residual sugar (-0.1)
The following variables positively affect the wine quality: - alcohol (0.44) - pH (0.1) - sulphates (0.05)
The correlations that also stand out from the correlation matrix are that of: - residual sugar and density (0.84) - alcohol and density (-0.78) - free sulfur dioxide and total sulfur dioxide (0.61) - density and total sulfur dioxide (0.53) - alcohol and residual sugar (-0.45)
Let’s investigate how the alcohol, density, chlorides and total sulfur dioxide affects wine quality, using boxplots to investigate these trends.
For bad quality wine, the alcohol content is just over 10 % by volume, then for average quality of 5, the alcohol content decreases and then increases from quality 6 to excellent quality (7 - 9).
The median density is fairly steady for bad quality wines (just under 0.995), then increases slightly for wine quality level of 5 and then gradually decreases as the quality level increases from average to excellent.
Chlorides seem to decrease steadily as quality increases.
Here we see that the median for the free sulfur dioxide is lower for bad quality wines (~ 20 mg/dm^3) and fairly consistent for average and excellent quality wines (30 - 35 mg/dm^3). The total sulfur dioxide is lower for bad quality wine, slightly higher for average quality and then decreases again for excellent quality wine. The free sulfur dioxide is responsible for preserving the wine, thus perhaps a wine with a lower free sulfur dioxide content may result in a wine that is not as fresh. Excessive amounts of total sulfur dioxide can inhibit fermentation as well as cause undesirable sensory flavour.
Let’s also have a look at the input variables that were strongly correlated using scatterplots, first we we will look at the relationship for the residual sugar and density.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
The residual sugar is very strongly positively correlated with density, with a correlation coefficient of 0.84. A red linear trend-line is plotted for the data.
Let’s now look at a scatterplot for alcohol and density.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Alcohol is strongly negatively correlated with density, with a correlation coefficient of -0.78.
Next let’s investigate the relationship between for free sulfur dioxide and total sulfur dioxide.
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$free.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
There is a strong, positive correlation between the free sulfur dioxide and total sulfur dioxide, the correlation coefficient of -0.62. This is because free sulfur dioxide constitutes as a portion of total sulfur dioxide, with the remainder being bound sulfur dioxide.
total SO2 = free SO2 + bound SO2 [5]
The free SO2 portion (not associated with wine molecules) essentially acts as a buffer against microbes and oxidation. Alternatively, the bound SO2 portion (which are sulfites bound to molecules such as sugars, acetaldehyde or phenolic compounds) has already done its work and is no longer useful as a preservative [5]. Sulfur dioxide levels need to be carefully regulated as not only does excess sulfur dioxide result in an unpleasant taste, they are also allergens and can be harmful to people in excess.
Next we investigated the relationship of density and total sulfur dioxide.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$total.sulfur.dioxide
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5094349 0.5497297
## sample estimates:
## cor
## 0.5298813
There is a moderate positive correlation between density and free total dioxide, the correlation coefficient of 0.53.
Finally the relationship between alcohol and residual sugar was investigated.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
There is a moderate negative correlation between alcohol and residual sugar, the correlation coefficient of -0.45.
The correlation matrix plot depicted that the following variables positively affect the wine quality: - alcohol (0.44) - pH (0.1) - sulphates (0.1)
Here we saw that alcohol had the biggest effect on wine quality, with higher quality wines having a higher alcohol content.
The following variables were shown to negatively affect the wine quality from the correlation matrix: - density (-0.31) - chlorides (-0.21) - volatile acidity (-0.2) - total sulfur dioxide (-0.18) - fixed acidity (-0.11) - residual sugar (-0.1)
The correlations that also stand out from the correlation matrix as well as the scatterplot investigation are the following: - residual sugar and density (0.84) - alcohol and density (-0.78) - free sulfur dioxide and total sulfur dioxide (0.61) - density and total sulfur dioxide (0.53) - alcohol and residual sugar (-0.45)
The strongest relationship was that between residual sugar and density, the higher the residual sugar (the remaining sugar after fermentation), the higher the density. This makes sense as when solid sugar is mixed with water, it dissolves and becomes part of the sugar-water solution, increasing the density of the solution as more sugar is added.
Alcohol had the greatest effect on wine quality, with better quality wines having a higher alcohol content.
Both residual sugar and alcohol are strongly correlated with density. Thus these three variables will be investigated.
Holding residual sugar constant, we can see that as the alcohol content increases, the density decreases. This may be because wine is more dense than ethanol. The density of ethanol is 0.789 g/cm^3 at 20 degrees celsius [8]. The density of white wine is around 0.98 g/cm^3 [9], thus the higher the alcohol volume percentage, the lower the volume percentage of wine, thus the wine solution density decreases due the density of ethanol being less than that of white wine.
Next we are going to analyze how residual sugar and chlorides affect density across quality. A new variable ‘density_bucket’ was created to do this analysis.
Chlorides are sodium chloride (NaCl), in other words the amount of salt in the wine. Here we can see that for the majority of data for each quality level, the higher the chlorides in the wine, the higher the density. When solid sodium chloride is added to a liquid solution, the molecules get closer together (via intermolecular forces) and the density increases. We also see that as the quality of the wine increases, the amount of chlorides decreases gradually.
The residual sugar is the amount of sugar left over after fermentation as the fermentation process consumes sugars to create ethanol as well as carbon dioxide as a by-product. From the graph above, the higher the residual sugar content in the wine, the higher the density of the wine. This is the same argument as for the sodium chloride above, there is a much clearer trend with the residual sugar and density than that of the chlorides.
Alcohol content and density were highly correlated with each other (Pearson’s r of -0.78), these variables were also the most correlated with quality. Thus the alcohol content and density will be investigated with the variable that was created earlier ‘level’ which is the wine quality level of ‘bad’, ‘average’ or ‘excellent’.
The slope for the excellent quality wine is the steepest. This means that with the same change in density, the change/ difference in alcohol content will be greater than that of the average and bad quality wine.
Let’s add contour lines onto the figure above and remove the scatter points.
Let’s investigate this plot above by using a 2D density plot. In the contour plot we can’t see the average quality layer very well, this should be more visible in the following plot.
These two plots seem to show the same relationship.
Lastly, a linear model for quality was created.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wine)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = wine)
## m3: lm(formula = I(quality) ~ I(alcohol) + density + chlorides, data = wine)
## m4: lm(formula = I(quality) ~ I(alcohol) + density + chlorides +
## volatile.acidity, data = wine)
## m5: lm(formula = I(quality) ~ I(alcohol) + density + chlorides +
## volatile.acidity + total.sulfur.dioxide, data = wine)
##
## ==============================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** -21.150*** -35.573*** -30.759***
## (0.098) (6.165) (6.162) (6.010) (6.295)
## I(alcohol) 0.313*** 0.360*** 0.343*** 0.389*** 0.391***
## (0.009) (0.015) (0.015) (0.015) (0.015)
## density 24.728*** 23.671*** 38.217*** 33.251***
## (6.079) (6.074) (5.926) (6.234)
## chlorides -2.382*** -1.300* -1.370*
## (0.558) (0.542) (0.543)
## volatile.acidity -2.043*** -2.070***
## (0.111) (0.111)
## total.sulfur.dioxide 0.001*
## (0.000)
## ----------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.195 0.248 0.249
## adj. R-squared 0.190 0.192 0.195 0.247 0.248
## sigma 0.797 0.796 0.795 0.768 0.768
## F 1146.395 583.290 396.315 402.956 324.034
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5822.011 -5657.292 -5654.027
## Deviance 3112.257 3101.773 3090.247 2889.234 2885.385
## AIC 11684.782 11670.255 11654.021 11326.584 11322.054
## BIC 11704.272 11696.241 11686.504 11365.563 11367.530
## N 4898 4898 4898 4898 4898
## ==============================================================================================
This is not a good model to predict quality, with R-squared values from 0.19 - 0.249.
Both residual sugar and alcohol are strongly correlated with density. Thus these three variables were investigated. Holding residual sugar constant, we saw that as the alcohol content increases, the density decreases. This may be because wine is more dense than ethanol.
Next we looked at how residual sugar and chlorides affect density. A new variable ‘density_bucket’ was created to do this analysis. It was found that when the content of either chlorides or residual sugar was higher, the higher the wine density. This is due to the solution becoming more dense when solids are added as the molecules get closer together.
Alcohol content and density were highly correlated with each other (Pearson’s r of -0.78), these variables were also the most correlated with quality. Thus the alcohol content and density was investigated with the variable that was created earlier ‘level’ which is the wine quality level of ‘bad’, ‘average’ or ‘excellent’. It was found that higher quality wines seem to have a higher alcohol content and a lower density.
It was interesting that the alcohol and density seemed to follow opposite trends, when the alcohol content decreases, the density increased and vice versa. This may be because wine is more dense than ethanol.
A linear model was contructed for quality, however this model was not a good fit with R-squared values from 0.19 - 0.249. This may be because the model itself is inaccurate or perhaps because these variables vary across quality, thus a general model to predict wine quality is not possible. It also may be that there are variables that affect wine quality that were not present in the dataset and are perhaps also not easily quantifiable such as smell.
A bar graph was plotted for the quality of wine which was categorized as follows: - Bad: 0 - 4 - Average: 5 - 6 - Excellent: 7 - 10
This stacked bar graph visually shows that the majority of the white wine is of average quality, there are very few bad quality wines in this dataset. The mode is 6 as it is the most occurring quality value (olive green portion is the largest), the median also happens to be six, with the mean being very close 5.8, which is an indication that the quality variable is normally distributed.
The correlation coefficients between the variables were evaluated and the following was found:
Here we saw that a few variables negatively affect the wine quality: - density (r = -0.31) - chlorides (r = -0.21) - volatile acidity (r = -0.2) - total sulfur dioxide (r = -0.18) - fixed acidity (r = -0.11) - residual sugar (r = -0.1)
The following variables positively affect the wine quality: - alcohol (r = 0.44) - pH (r = 0.1) - sulphates (r = 0.1)
The correlations that also stand out from the correlation matrix are that of: - residual sugar and density (r = 0.84) - alcohol and density (r = -0.78) - free sulfur dioxide and total sulfur dioxide (r = 0.61) - density and total sulfur dioxide (r = 0.53) - alcohol and residual sugar (r = -0.45)
For the boxplots for alcohol vs quality: For bad quality wine, the alcohol content is just over 10 % by volume, then for average quality of 5, the alcohol content decreases and then increases from quality 6 to excellent quality (7 - 9).
For the boxplots for density vs quality: The median density is fairly steady for bad quality wines (just under 0.995), then increases slightly for wine quality level of 5 and then gradually decreases as the quality level increases from average to excellent.
It’s interesting that the alcohol and density versus quality seem to follow opposite trends, when the alcohol content decreases, we see the density increase. The data points are also overlaid in both graphs in black as well as jittered for visibility. Here we see that there is more data available for the average quality wines.
The residual sugar is very strongly positively correlated with density, with a correlation coefficient of 0.84. A red linear trendline is plotted for the data.
Alcohol is strongly negatively correlated with density, with a correlation coefficient of -0.78.
These variables were the most strongly correlated in the white wine dataset.
Both residual sugar and alcohol are strongly correlated with density. Thus these three variables were investigated.
Holding residual sugar constant, we can see that as the alcohol content increases, the density decreases. This may be because wine is more dense than ethanol. The density of ethanol is 0.789 g/cm^3 at 20 degrees celsius [8]. The density of white wine is around 0.98 g/cm^3 [9], thus the higher the alcohol volume percentage, the lower the volume percentage of wine, thus the wine solution density decreases due the density of ethanol being less than that of white wine.
Alcohol content and density were highly correlated with each other (Pearson’s r of -0.78), these variables were also the most correlated with quality. Thus the alcohol content and density was investigated with the variable that was created earlier ‘level’ which is the wine quality level of ‘bad’, ‘average’ or ‘excellent’.
The first plot is a scatter plot with an overlay of contour lines by quality level. The second plot contain the same variables but is a 2D density plot. The two plots above seem to show the same relationship. For average quality wine we cannot see the contour clearly but in the 2D density plot we see a peak at alcohol ~ 9.4 vol% and density ~ 0.45 g/cm^3. What is interesting are the three peaks for excellent quality wine, we see the three contours in the first plot which is corroborated by the peaks in the second plot. High quality wines seem to have a higher alcohol content and a lower density - in general the peaks are not as high for excellent quality wine as that of average and bad quality wine. This further supports the discussion from Plot 5.
From the White Wine Quality Analysis the following was concluded:
In conclusion, alcohol content seemed to have the greatest effect on wine quality (the higher the alcohol content, the greater the wine quality) and not residual sugar as I initially thought, this had very little effect on wine quality. Average and excellent quality wines seemed to have a slightly higher free and total sulfur dioxide content, this adds to the freshness of the wine, however, in excess SO2 can produce a bad odour/ taste as well as be a health allergen.
From this investigation, it seems that predicting the wine quality based on these chemical properties proved to be challenging (the linear model was not a good predictor for wine quality). There may be other variables that were not quantified in this dataset such as grape type, climate, temperature, sunlight, soil, levels of tannins in the wine, aging process and so on that may have a greater effect on wine quality.
In future, it may be better to explore these trends by quality level, for this dataset we had mainly average quality wine, it would also be beneficial to have a larger dataset with more bad and excellent quality wines.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236
https://stackoverflow.com/questions/6557977/how-do-i-add-the-mean-value-to-a-histogram-in-r/6558014#6558014 (add mean line to graph)
http://www.talkstats.com/threads/adding-a-new-column-in-r-data-frame-with-values-conditional-on-another-column.30924/ (ifelse statement, creating level column)
https://stackoverflow.com/questions/24895575/ggplot2-bar-plot-with-two-categorical-variables (stacked bar chart)
https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/
http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2 (correlation matrix)
https://www.r-graph-gallery.com/264-control-ggplot2-boxplot-colors/ (fill histogram levels)
https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Ethanol_(data_page).html (ethanol density)
https://stats.stackexchange.com/questions/31726/scatterplot-with-contour-heat-overlay (scatterplot with contour overlay)
https://stackoverflow.com/questions/23675735/how-to-add-boxplots-to-scatterplot-with-jitter (add scatterplot to boxplot)
https://stackoverflow.com/questions/12980081/create-a-stacked-density-graph-in-ggplot2 (stacked density graph)
http://petewerner.blogspot.co.za/2012/12/density-plot-with-ggplot.html (stacked density graph)